Language interoperable runtime adaptable data collections

ABSTRACT

Adaptive data collections may include various type of data arrays, sets, bags, maps, and other data structures. A simple interface for each adaptive collection may provide access via a unified API to adaptive implementations of the collection. A single adaptive data collection may include multiple, different adaptive implementations. A system configured to implement adaptive data collections may include the ability to adaptively select between various implementations, either manually or automatically, and to map a given workload to differing hardware configurations. Additionally, hardware resource needs of different configurations may be predicted from a small number of workload measurements. Adaptive data collections may provide language interoperability, such as by leveraging runtime compilation to build adaptive data collections and to compile and optimize implementation code and user code together. Adaptive data collections may also provide language-independent such that implementation code may be written once and subsequently used from multiple programming languages.

This application is a continuation of U.S. patent application Ser. No.16/165,593 (now issued as U.S. Pat. No. 10,803,087), filed Oct. 19,2018, which is hereby incorporated by reference herein its entirety.

BACKGROUND Field of the Disclosure

This disclosure relates to placement of memory, threads and data withinmulti-core systems, such as data analytics computers with multiplesockets per machine, multiple cores per socket, and/or multiple threadcontexts per core.

DESCRIPTION OF THE RELATED ART

Modern computer systems, such as those used for data analytics are oftensystems with multiple sockets per machine, multiple cores per socket andmultiple thread contexts per core. Obtaining high performance from thesesystems frequently requires the correct placement of data to be accessedwithin the machine. There have been increasing demands on systems toefficiently support big data processing, such as database managementsystems and graph processing systems, while attempting to store andprocess data in-memory.

However, traditional implementations of big-data analytics frameworksare generally slow and frequently involve recurring issues such ascostly transfers of data between disk and main memory, inefficient datarepresentations during processing, and excessive garbage collectionactivity in managed languages. Additionally, analytics workloads may beincreasingly limited by simple bottlenecks within the machine, such asdue to saturating the data transfer rate between processors and memory,saturating the interconnect between processors, saturating a core'sfunctional units, etc.

Existing solutions may also exhibit workload dependencies, programmingdifficulties due to hardware characteristics, as well as programminglanguage dependencies, thereby potentially limiting their usefulness inmodern environments that frequently include a diverse range of differentprogramming languages. Similarly, existing solutions require strictlydefined interfaces between languages that treat native code as a blackbox, thus introducing a compilation barrier that can degradeperformance.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a logical block diagram illustrating an adaptive datacollection implementation on a multi-socket computer system, accordingto one embodiment.

FIG. 2 is a logical block diagram illustrating one example technique foraccessing an adaptive data collection, according to one embodiment.

FIGS. 3 a-3 d are logical block diagrams illustrating example parallelarray aggregation with different smart functionalities, according to oneembodiment.

FIG. 4 is a logical block diagram illustrating one example of accessingan adaptive data collection (e.g., an adaptive array), according to oneembodiment.

FIG. 5 is a logical diagram illustrating an example of accessing anadaptive data collection that includes multiple interfaces andfunctionalities.

FIG. 6 a is a logical diagram illustrating replication of an adaptivedata collection across sockets in one embodiment.

FIG. 6 b is a logical diagram illustrating bit compressions of anadaptive data collection, according to one embodiment.

FIG. 7 is a logical diagram illustrating an example software model forimplementing adaptive data collections, according to one embodiment.

FIG. 8 is a flowchart illustrating one embodiment of a method forplacement configuration selection, as described herein.

FIG. 9 is a flowchart illustrating one method for uncompressed placementcandidate selection according to one example embodiment.

FIG. 10 is a flowchart illustrating one method for compressed placementcandidate selection according to one example embodiment.

FIG. 11 is a logical block diagram illustrating one example adaptivedata collection accessible via the Java programming language, accordingto one embodiment.

FIG. 12 is a logical block diagram illustrating one embodiment of acomputing system that is configured to implement the methods, mechanismsand/or techniques described herein.

In the following detailed description, numerous specific details are setforth to provide a thorough understanding of claimed subject matter.However, it will be understood by those skilled in the art that claimedsubject matter may be practiced without these specific details. In otherinstances, methods, apparatuses or systems are not described in detailbelow because they are known by one of ordinary skill in the art inorder not to obscure claimed subject matter.

While various embodiments are described herein by way of example forseveral embodiments and illustrative drawings, those skilled in the artwill recognize that embodiments are not limited to the embodiments ordrawings described. It should be understood that the drawings anddetailed description thereto are not intended to limit the embodimentsto the particular form disclosed, but on the contrary, the intention isto cover all modifications, equivalents and alternatives falling withinthe spirit and scope of the disclosure. Any headings used herein are fororganizational purposes only and are not meant to be used to limit thescope of the description. As used throughout this application, the word“may” is used in a permissive sense (i.e., meaning having the potentialto), rather than the mandatory sense (i.e., meaning must). Similarly,the words “include”, “including”, and “includes” mean including, but notlimited to.

Some portions of the detailed description which follow are presented interms of algorithms or symbolic representations of operations on binarydigital signals stored within a memory of a specific apparatus orspecial purpose computing device or platform. In the context of thisparticular specification, the term specific apparatus or the likeincludes a general-purpose computer once it is programmed to performparticular functions pursuant to instructions from program software.Algorithmic descriptions or symbolic representations are examples oftechniques used by those of ordinary skill in the signal processing orrelated arts to convey the substance of their work to others skilled inthe art. An algorithm is here, and is generally, considered to be aself-consistent sequence of operations or similar signal processingleading to a desired result. In this context, operations or processinginvolve physical manipulation of physical quantities. Typically,although not necessarily, such quantities may take the form ofelectrical or magnetic signals capable of being stored, transferred,combined, compared or otherwise manipulated. It has proven convenient attimes, principally for reasons of common usage, to refer to such signalsas bits, data, values, elements, symbols, characters, terms, numbers,numerals or the like. It should be understood, however, that all ofthese or similar terms are to be associated with appropriate physicalquantities and are merely convenient labels.

Unless specifically stated otherwise, as apparent from the followingdiscussion, it is appreciated that throughout this specificationdiscussions utilizing terms such as “processing,” “computing,”“calculating,” “determining” or the like refer to actions or processesof a specific apparatus, such as a special purpose computer or a similarspecial purpose electronic computing device. In the context of thisspecification, therefore, a special purpose computer or a similarspecial purpose electronic computing device is capable of manipulatingor transforming signals, typically represented as physical electronic ormagnetic quantities within memories, registers, or other informationstorage devices, transmission devices, or display devices of the specialpurpose computer or similar special purpose electronic computing device.

SUMMARY

Described herein are systems, methods, mechanisms and/or techniques forimplementing language interoperable runtime adaptive data collections.Adaptive data collections may include various type of data arrays, sets,bags, maps, and other data structures. For each adaptive datacollection, there may be a simple interface providing access via aunified application programming interface (API). Language interoperableruntime adaptive data collections, which may be referred to herein assimply “adaptive data collections” or “smart collections” (e.g., arrays,sets, bags, maps, etc.), may provide different adaptive (or smart)implementations and/or data functionalities of the same adaptive datacollection interface. For example, various adaptive data functionalitiesmay be developed for various data layouts, such as different Non-UniformMemory Access (NUMA) aware data placements, different compressionschemes (e.g., compression of data within a collection), differentindexing schemes within a collection, different data synchronizationschemes, etc.

A system configured to implement adaptive data collections may includethe ability to adaptively select between various data functionalities,either manually or automatically, and to map a given workload todifferent hardware configurations (e.g., different resourcecharacteristics). Various configurations specifying different datafunctionalities may be selected during an initial data collectionconfiguration as well as dynamically during runtime, such as due tochanging executing characteristics or resource characteristics (e.g., ofthe workload). Described herein are algorithms for dynamically adaptingdata functionalities (e.g., smart functionalities) to a given system andworkload, according to various embodiments.

As described herein, adaptive data collections may provide languageinteroperability, such as by leveraging runtime compilation to buildadaptive data collections as well as to efficiently compile and optimizedata functionality code (e.g., smart functionalities) and the user codetogether. For example, in one embodiment a system configured toimplement the methods, mechanisms and/or techniques described herein mayimplement adaptive NUMA-aware (Non-Uniform Memory Access) data placementand/or bit compression for data collections in a language-independentmanner through runtime compilation. Adaptive data collections may alsoprovide language-independent access to content and data functionalities,such that optimization code may be written once and subsequently reusedvia (e.g., accessed from) multiple programming languages. For example,according to one embodiment adaptive data collections implemented in C++may be accessed from workloads written in C++ or other languages, suchas Java, via runtime compilation.

Additionally, in some embodiments adaptive data collections may improveparallelism & scheduling, such as by integrating adaptive datacollections with a runtime system in order to conveniently providefine-grained parallelism and scheduling for the workloads that useadaptive data collections. Adaptive data collections may also provideadaptivity, such as by utilizing an adaptivity workflow that can predicthardware resource needs of different configurations (e.g., NUMA-awaredata placement, bit compression, etc.) from a small number of workloadmeasurements. Such adaptivity may provide a one-time adaptation orongoing runtime adaptivity, according to various embodiments. Forexample, a system implementing adaptable data collections may select aparticular adaptivity configuration for the data functionalities (e.g.,the best configuration based on certain criteria) based at least in parton one or more predicted resource requirements (e.g., of a workload). Aconfiguration may specify one or more data functionalities for a givendata collections. For example, a configuration may specify a dataplacement functionality, such as a particular NUMA-aware data placementscheme, a compression algorithm for compression data of the datacollection, an element indexing scheme for the data collection, etc.

System configured to implement adaptive data collections as describedhere may improve their performance (e.g., by exploiting adaptive datafunctionalities, such as NUMA-aware data placement, data compression,etc.), reduce hardware resource requirements (e.g., reducing main memoryrequirements though data compression, etc.) and/or simplify programmingacross multiple languages, thereby potentially reducing development andmaintenance costs, according to various embodiments. In someembodiments, adaptive data collections, such as adaptive arrays may beintegrated into a runtime system, such as to provide fine-grainedefficient parallelism and scheduling to the workloads that access theadaptive data collections.

DETAILED DESCRIPTION OF EMBODIMENTS

As noted above, systems, methods, mechanisms and/or techniques describedherein may, in some embodiments, implement a system for implementingadaptive data collections. An array may be considered one of the mostprominent types of in-memory data collection or in-memory datastructure. Various systems, methods, mechanisms and/or techniques forimplementing adaptive data collections are described herein mainly interms of arrays (e.g., adaptive or smart arrays). However, the systems,methods, mechanisms and/or techniques described herein may be applied toany suitable data collections or data structures. For example, adaptivedata collections may be include arrays, sets, bags, maps and/or otherdata structures, according to various embodiments. For each adaptivedata collection, there may a simple interface to access the collectionvia a unified API. For example, a map may have an interface to accessthe keys and associated values. Additionally, an adaptive datacollection may have multiple, different data layout implementations insome embodiments.

FIG. 1 is a logical block diagram illustrating an adaptive datacollection implementation on a multi-socket computer system, accordingto one embodiment. While FIG. 1 illustrates a system according to oneexample embodiment, the methods, mechanisms and/or techniques forimplementing adaptive data collections described herein may beapplicable to any application/system that uses data collections forstoring and processing data. For example, adaptive data collections maybe implemented on database management and graph processing systems,thereby potentially allowing those systems to utilizelanguage-independent, and/or runtime adaptive, optimizations forNUMA-awareness and bit compression, according to various embodiments. Asystem, such as multi-socket computer system 100, configured toimplement language interoperable runtime adaptable data collections mayinclude (i.e., provide, execute, etc.) a platform-independent virtualenvironment 150, such as a Java-based virtual machine (VM) in oneexample embodiment, within which various other software components mayexecute. For example, one or more applications 140 may execute within(or be executed by) the virtual environment 150 and may be configured toutilize and/or access one or more adaptive data collections 110. Aruntime cross language compiler 130 may be configured to optimize and/orcompile (e.g., at runtime) both application (e.g., user) code andadaptive data collection code, such as one or more language specificinterface(s) 120, according to various embodiments.

In order to support language interoperability efficiently and seamlesslyto multi-language workloads that use adaptive data collections, the codeof the adaptive data collection may be tailored to compile with runtimecross-language compiler 130. Additionally, adaptive data collections maybe implemented on (or using) a platform-independent virtual environment150 configured for compiling and running multi-language applications. Insome embodiments platform-independent virtual environment 150 mayinclude, or be configured to be, a language interpreter (i.e., ofabstract syntax trees, bytecode, etc.) that may use the runtimecross-language compiler 130 to dynamically compile guest languageapplications to machine code. For example, the system may, according toone embodiment include language implementations for any of variouslanguages, such as C/C++, JavaScript, Python, R, and Ruby, as a fewexamples. Thus in one example embodiment, adaptive data collections maybe implemented using the Graal™ virtual machine (GraalVM) based on theJava HotSpot virtual machine (VM) including the Truffle™ languageinterpreter as well as Sulong™ (i.e., a Truffle implementation of LLVMbitcode).

Additionally, the system may also include an implementation of low levelvirtual machine (LLVM) bitcode, according to one embodiment. A LLVM mayutilize one or more front end compilers to compile source languages toLLVM bitcode that may be interpreted on the platform-independent virtualenvironment (e.g., the VM).

In some embodiments, multi-socket computer system 100 may comprise oneor more interconnected sockets of multi-core processors. Memory withinthe system 100 may be decentralized and attached to each socket in acache-coherent non-uniform memory access (ccNUMA) architecture. FIG. 2is a logical diagram illustrating an example machine with two sockets200 and 205, each containing an 8-core CPU 210 with two hyper-threadsper core, according to one example embodiment. Each core may have aL1-I/D 215 and L2 cache 220. Each CPU has a shared L3 last-level cache225. Each socket may have four 32 GB DIMMs 235 as well as a memorycontroller 230, and the two sockets may be connected with aninterconnect 240, such as Intel QuickPath Interconnect (QPI) 16 GB/slinks in one example embodiment.

Although NUMA topologies may vary, such as by the number of sockets,processors, memory, interconnects, etc., there may be a few commonfundamental performance characteristics, such as remote memory accessesbeing slower than local accesses, the bandwidth to a socket's memory andinterconnect may be separately saturated, and the bandwidth of aninterconnect is often much lower than a socket's local memory bandwidth.Thus, in some embodiments performance-critical applications may need tobe NUMA-aware by using OS facilities to control the placement of dataand of threads. For example, in one example operating system, thedefault data placement policy may be to physically allocate a virtualmemory page on the particular socket on which the thread (e.g., thatfirst touches the memory) is running (e.g., potentially after raising apage-fault). Other policies may include explicitly pinning pages onsockets and interleaving pages in a round-robin fashion across sockets.

In some embodiments, adaptive data collections may be implemented on aC++ runtime system, such as the Callisto runtime system (RTS) in oneexample embodiment, that supports parallel loops with dynamicdistribution of loop iterations between worker threads. For example, insome embodiments, platform-independent virtual environment 150 may be(or may include, or may be part of) such a runtime system. Utilizing aRTS supporting parallel loops with dynamic distribution of loopiterations between worker threads may in some embodiments provide aprogramming model similar to dynamically scheduled loops except that thework distribution techniques may permit a more finely grained andscalable distribution of work (e.g., even on an 8-socket machine with1024 hardware threads). In some embodiments, adaptive data collectionsmay be implemented using a library (e.g., a Java library) to expressloops. For example, the loop body may be written as a lambda functionand each loop may execute over a pool of Java worker threads makingcalls from Java to C++ each time the worker requires a new batch of loopiterations with the fast-path distribution of work between threadsoccurring in C++. For example, Java Native Interface (JNI) calls may beused to interface between Java and C++. Additionally, in someembodiments the use of JNI may be designed to pass only scalar values,thus potentially avoiding typically costly cases.

The implementation of adaptive data collections, as described herein,may support additional data functionalities (e.g., smartfunctionalities) to express resource trade-offs, such as multiple dataplacement options within a NUMA machine and bit compression of thecollection's content. Additionally, in some embodiments, adaptive datacollections may support randomization, such as a fine-grainedindex-remapping of a collection's elements. This kind of permutationmay, in some embodiments, ensure that “hot” nearby data items are mappedto storage on different locations served by different memory channels,thus potentially reducing hot-spots in the memory systems if one memorychannel becomes saturated before others. In some embodiments, dataplacement techniques may be extended with partitioning data across theavailable threads based on domain specific knowledge. Moreover,alternative compression techniques may be utilized with adaptive datacollections that may achieve higher compression rates on differentcategories of data, such as dictionary encoding, run-length encoding,etc. Furthermore, in some embodiments, adaptive data collections mayinclude synchronization support and/or data synchronization schemes,such as to support both read-based and write-based concurrent workloads.

Similar to smart data functionalities, different data layouts maysupport different trade-offs between the use of hardware resources andperformance. For example, in some embodiments, adaptive arrays may beused to implement data layouts for sets, bags, and maps, such as byencoding binary trees into arrays, where accessing individual elementsmay require up to log₂n non-local accesses (where n is the size of thecollection). To trade size against performance hashing may be used insome embodiments instead of trees to index the adaptive arrays.

FIGS. 3 a-3 d are logical block diagrams illustrating examples ofparallel array aggregation with different smart data functionalities.FIGS. 3 a-3 d illustrates smart data functionalities for a parallelsummation of an array on a 2-socket NUMA machine, according to oneexample embodiment. When the array 310 is placed on a single socket 300with accesses 315 coming from threads on both sockets 300, 305, asillustrated in FIG. 3 a , the bottleneck may be the socket's memorybandwidth. When the array is interleaved 320, 325 across the machine'ssockets, as illustrated in FIG. 3 b (showing the same two sockets), bothsockets' memory bandwidth may be used to decrease the execution time,and the bottleneck may be the interconnect 330. If the array isreplicated 340, 345 across sockets using more memory space, asillustrated in FIG. 3 c , memory access may be localized and theinterconnect may be removed as a bottleneck to further decrease theexecution time. Finally, memory bandwidth may be used more productivelyby compressing 350 and 355 the array's contents, as illustrated in FIG.3 d , such as to pass more elements through the same memory bandwidthand thereby potentially achieving better performance.

Additionally, adaptive data collections may provide language-independentaccess to their contents and smart data functionalities. For example, anadaptive data collection may be implemented once in one language, suchas C++, but may be accessed from workloads written in other languages,such as C++ or Java. FIG. 4 is a logical block diagram illustrating oneexample of accessing an adaptive data collection (e.g., an adaptivearray), according to one embodiment. The underlying data structure, suchas single implementation array 480, accessible via thin interface 410(illustrated by array 470) may be implemented once in one language(e.g., C++) and exposed to different languages via thin per-languagewrappers, such as per-language thin interface 410. Various optionsregarding the implementation, such as what data placement to use and/orwhether to use compression, etc., may be determined manually 230, suchas by the programmer, or automatically 460 by the system during runtime,such as via adaptivity mechanism 450, according to various embodiments.The connection between thin interface 410 and single implementation 420may be set (e.g., configured) at compile time and all versions (e.g.,instances, embodiments, etc.) of a given data collection may use thesame API. Additionally, thin interfaces in different language may targetthe same API. Thus, application code may be able to access array 480 viaunified API 440 regardless of the particular language(s) with witch theapplication was developed.

While FIG. 4 illustrates an adaptive array and corresponding interface,similar implementations may be used for other types of collections aswell. FIG. 5 is a logical diagram illustrating an example of accessingan adaptive data collection that includes multiple interfaces and datafunctionalities. Rather than illustrating a single array collection type480, as in FIG. 4 , FIG. 5 illustrates multiple collection typesaccessible via thin interfaces 410 (e.g., array 470 and map 520).Additionally, data layout 540 may include a single data layout 480 forthe array collection type, but two different data layouts 460, 570 forthe map collection type. According to the example embodiment of FIG. 5 ,map 520 may be implemented as either linked lists 570 representingbuckets or as arrays representing trees, skip maps, etc. Thedetermination of which alternate implementation (e.g., data layout) touse may be hidden from the user (e.g., programmer) in some embodiments,but the user may be able to specify which to use in other embodiments.Note that FIG. 5 illustrates only a few examples of possibleimplementations according to one example embodiment and that otherimplementations and/or data layouts may be utilized in otherembodiments. Access to the underlying data structures may be provided byadaptive data functionalities 530 via unified API 580, which may providedifferent access models, such as array 510 and map 520, usable byapplication code to access the adaptive data collections (e.g., storedin data layout 540 in this example).

Additionally, adaptive data collections may perform aggregations withadaptive data arrays, which may be referred to as smart arrays herein,and which may be relevant to databases, and may also perform a number ofgraph analytics algorithms, which may be relevant to graph processingsystems. Specifically for graph processing, a traditional way to storethe graph data may be in compressed sparse row (CSR) format in whicheach vertex has an ID. Within the CSR, an edge array may concatenate theneighborhood lists of all vertices (e.g., forward edges in case ofdirected graphs) using vertex IDs, in ascending order. Another array mayhold array indices pointing to the beginning of the neighborhood list ofthe vertices. Two other similar arrays (e.g., r_edge and r_begin) mayhold the reverse edges for directed graphs. Additional arrays may beneeded to store vertex and edge properties, as well as for someanalytics algorithms and their output. Thus, adaptive data collections,such as smart arrays may be used to replace all these arrays, such as toexploit their adaptive data functionalities for graph analytics,according to some embodiments.

Even without exploiting smart data functionalities, the performanceachieved from Java workloads may be similar to Java's built-in arraytypes. Additionally, there may be trade-offs involving the consumptionof various hardware resources, such as memory bandwidth and space.Programmers may need to choose the specific implementation that fits thetarget hardware, workload, inputs, and system activity. Moreover,different scenarios may require these trade-offs to be made in differentprogramming languages. Adaptive data collections, as described hereinmay aid in solving these problems, according to some embodiments.

Adaptive data collections may support various NUMA-aware placements thatneed to be adapted to the workload and system, according to variousembodiments, such as:

OS default. For NUMA-agnostic applications, or other applications thatdo not need to specify a data placement, the default OS data placementpolicy may be used. Depending on how the adaptive data collection isinitialized, its physical location may vary from one socket (e.g., ifone thread initializes the array) to random distribution across sockets(e.g., if multiple threads initialize the array).

Single socket. In some embodiments, the adaptive data collection'smemory pages may be physically allocated on a specified socket. Thisplacement may be beneficial or detrimental depending on the relativebandwidths, as well as the maximum compute capability of the givenprocessors. In some cases the speedup of those threads that are local tothe data may outweigh the slowdown of remote threads.

Interleaved. In some embodiments, the adaptive data collection's memorypages may be physically allocated across the sockets, such as in around-robin fashion. This may be a default option to distribute memoryspace and local/remote memory accesses across sockets, but in someembodiments, there may be a bandwidth bottleneck on interconnects.

Replicated. One replica of the adaptive data collection (or of thecollection's data) may be placed on each socket in some embodiments. Aconceptual example is shown in FIG. 6 a which is a logical diagramillustrating replication of an adaptive data collection across socketsin one embodiment. As illustrated in FIG. 6 a , array 610 is replicatedon both sockets, resulting in replicated adaptive array 620 on socket 1and replicated adaptive array 630 on socket 2, according to one exampleembodiment. Note that adaptive array 630 may not be replicated itself,but adaptive array 630 may replicate some or all of its internal stateacross multiple sockets, according to some embodiments. For example,without replication a single array may hold the array's data internal tothe collection, whereas with replication, multiple arrays (e.g., 1 oneach of multiple sockets) may hold the data while a corresponding arrayholding references to those arrays may be maintained as well. Thus,portions (e.g., most) of the internal state of the collection may bereplicated while the reference array may be accessed/queried onlyoccasionally, thereby potentially avoiding a bottleneck. All of thesearrays (e.g., both the reference array and the data arrays on individualsockets) may be considered internal to the collection (e.g., theadaptive array) and therefore, even when data is replicated acrossmultiple sockets, only a single adaptive array may be apparent from auser/programmer's point of view. In some embodiments, the placementillustrated by FIG. 6 a may be considered the most performant solutionfor read-only or read-mostly workloads, such as analytics (since eachthread may have fast local accesses to a collection's replica), butreplication may come at the cost of a larger memory footprint as well asadditional initialization time for replicas.

Bit compression may be considered a light-weight compression techniquepopular for many analytics workloads, such as column-store databasesystems as one example. Bit compression may use less than 64 bits forstoring integers that require fewer bits. By packing the required bitsconsecutively across 64-bit words, bit compression can pack the samenumber of integers into a smaller memory space than the one required forstoring the uncompressed 64-bit integers. For instance, FIG. 6 b is alogical diagram illustrating bit compressions of an adaptive datacollection, according to one embodiment. FIG. 6 b shows, according toone example embodiment, an example of compressing an array 640 with twoelements, 650 and 660, into a bit-compressed array 670 of two elements,680 and 690, using 33 bits per element. The number of bits used perelement may be the minimum number of bits required to store the largestelement in the array in some embodiments.

Bit compression within adaptive data collections may decrease thedataset's memory space requirements, while increasing the number ofvalues per second that can be loaded through a given bandwidth,according to some embodiments. Additionally, in some embodiments, bitcompression may increase the CPU instruction footprint (e.g., since eachprocessed element may need to be compressed when initialized anddecompressed to a suitable format that the CPU can work with directly,such as 32 or 64 bits, when being accessed). However, this additionalwork (e.g., additional instructions required for bit compression anddecompression compared to uncompressed elements) may be hidden (e.g.,may not significantly affect overall performance) when iteratingsequentially over a bit-compressed array that has a memory bandwidthbottleneck, potentially resulting in faster performance for thecompressed array according to one embodiment.

In some embodiments, an adaptive data collection's implementation may bebased on logically chunking the elements of a bitcompressed array intochunks of 64 numbers. This ensures that the beginning of the first andthe end of the last number of the chunk are aligned to 64 bit words forall cases of bit compression from 1 bit to 64 bits. Thus, the samecompression and decompression logic may be executed across chunks. Whilediscussed above regarding chunks of 64 numbers, in some embodiments,other chunk sizes may be utilized depending on the exact nature andconfiguration of the machine and/or memory system being used.

Illustrated below is an example function (Function 1) including logic ofan example “getter” (e.g., a method to obtain an element) of an adaptivedata collection (e.g., an adaptive array) compressed with BITS number ofbits. In the example function below, BITS may be a C++ class templateparameter, so there may be 64 classes allowing much of the arithmeticoperations to be evaluated at compile time. Additionally, in someembodiments, BITS may indicate a number bits supported directly by theCPU (e.g., 32, 64, etc.) in which case compression/decompression codemay not be required, since the CPU may be able to work directly on theelements of the array. The example function below performs preparatorywork to find the correct chunk index (line 1), the chunk's starting wordin the array (lines 2-3), the corresponding chunk's starting bit andword (lines 4-5), the requested index's starting word in the array (line6), and the mask to be used for extraction (line 7). If the requestedelement lies wholly within a 64-bit word (line 8), it is extracted witha shift and a mask (line 9). If the element lies between two words (line10), its two parts are extracted and are combined to return the element(line 11). The example functions assumes little-endian encoding, howeverany suitable encoding may be used in different embodiments.

Function 1 - BitCompressedArray::get(index, replica)  1: chunk ← index /64  2: wordsPerChunk ← BITS  3: chunkStart ← chunk * wordsPerChunk  4:bitInChunk ← (index % 64) * BITS  5: bitInWord ← bitInChunk % 64  6:word ← chunkStart + (bitInChunk / 64)  7: mask ← (1 << BITS) − 1  8: ifbitInWord + BITS <= 64 then  9: return (replica[word] >> bitInWord) &mask 10: else 11: return ((replica[word] >> bitInWord) |(replica[word+1] << (64-bitInWord))) & mask

Illustrated below is an example function (Function 2) illustratinginitialization logic of an adaptive data collection (e.g., an adaptivearray) compressed with BITS number of bits, according to one exampleembodiment. For instance, after performing the same preparatory work asthe getter (described above regarding example function 1), the exampleinit function below calculates whether the element needs to be splitacross two words (line 2). The init function may then initialize theelement for each replica if the array is replicated (line 3). If theelement wholly fits in the first word, its value is set (line 4). If itspills over to the next word (line 5), its second part is set in thenext word (line 6). While not illustrated in the example function below,in some embodiments a thread-safe variant of the function may beimplemented using atomic compare-and-swap instructions or using locks,such as having one lock per chunk. In cases of concurrent read and writeaccesses the user of adaptive data collections may need to synchronizethe accesses.

Function 2 - BitCompressedArray::init(index, value) 1: /* ... same aslines 2-8 of Function 1 ... */ 2: word2 ← chunkStart + ((bitInChunk +BITS) / 64) 3: for replica = 0 to replicas do 4: data[replica][word] =(data[replica][word] & ~(mask<<bitInWord))|(value<<bitInWord) 5: if word!= word2 then 6: data[replica][word2] = (data[replica][word] &~(mask>>(64-bitInWord)))|(value>>(64-bitInWord))

Additionally, in order to optimize scans of an adaptive data collection,such as an adaptive array, which may be significant operations inanalytics workloads, an adaptive data collection may support a functionthat can unpack a whole chunk of a bitcompressed collection. Illustratedbelow is an example function (Function 3) that shows unpack logicconfigured to condense consecutive getter operations for a completechunk of a replica and output the 64 numbers of the chunk to a givenoutput buffer, according to one example embodiment. After performingsimilar preparatory work (lines 1-4) as in Function 1 above, thefunction starts iterating over the chunk's elements (line 5). For everyelement, the function determines whether it is wholly within the currentword (line 6). If it is, it is output (line 7) and the functioncontinues to the next element (line 8). If the current element alsofinishes the current word (line 9), it is output (line 10), the bitindex is reset to the current word (line 11), and the function continuesto the next word (lines 12-13). If the current element crosses over tothe next word (line 14), the element is made up from its two partsacross the words and is output (lines 15-17), before continuing on tothe next element (lines 18-20). The main loop of the function may beunrolled manually or automatically (by the compiler) according tovarious embodiments, such as to avoid the branches and permitcompile-time derivation of the constants used.

Function 3 - BitCompressedArray::unpack(chunk, replica, out)  1:chunkStart ← chunk * wordsPerChunk  2: word ← chunkStart  3: value ←replica[word]  4: bitInWord ← 0  5: for i = 0 to 64 do  6: ifbitInWord + BITS < 64 then  7: out[i] = (value >> bitInWord) & mask  8:bitInWord += BITS  9: else if bitInWord + BITS == 64 then 10: out[i] =(value >> bitInWord) & mask 11: bitInWord = 0 12: word++ 13: value =replica[word] 14: else 15: nextWord = word + 1 16: nextWordValue =replica[nextWord] 17: out[i] = mask & ((value >> bitInWord) |(nextWordValue << (64-bitInWord))) 18: bitInWord = (bitInWord + BITS) −64 19: word = nextWord 20: value = nextWordValue

FIG. 7 is a logical diagram illustrating an example software model forimplementing adaptive data collections, according to one embodiment.Specifically, FIG. 7 shows a unified modeling language (UML) diagramincluding classes of an example adaptive array (e.g., SmartArray) andtheir associated APIs, such as the iterator API. The example SmartArrayclass in FIG. 7 is an abstract class holding the basic properties thatsignify whether the SmartArray is replicated, interleaved or pinned to asingle socket, and the number of bits with which it is bit compressed,as illustrated by variables 702. If the SmartArray is replicated (i.e.,if the SmartArray replicates its internal state), the replicas array 704may hold a pointer per socket that points to the replicated dataallocated on the corresponding socket. As described above, from auser/programmer's point of view there may appear to be only a singleadaptive data collection with replication of the data structures hiddeninside the adaptive data collection. If replication is not enabled,there may be a single replica in the replicas array. The allocate( )static function 706 may create a new Smart Array using the concretesub-classes depending on the bit compression, and may allocate thereplica(s) considering the given data placement parameters. ThegetReplica( ) function 708 may be configured to return the replicacorresponding to the socket of the calling thread. The remainingfunctions correspond to the pseudo code shown in example Functions 1-3above.

The concrete sub-classes 710, 720 and 730 of SmartArray 700 maycorrespond to all cases of bit compression with a number of bits 1-64,according to the example embodiment. The cases of bit compression with32 and 64 bits (e.g., sub-classes 720 and 730) are specialized in theexample embodiment since they directly map to native integers as definedon the system of the example embodiment. Consequently,BitCompressedArray<32> and BitCompressedArray<64> may be implementedwith simplified getter, initialization, and unpack functions that do notrequire shifting and masking, according to some embodiments.

In addition to a random access API of the Smart Array class, a forwarditerator for efficient scans may be implemented, as illustrated bySmartArraylterator 740 in FIG. 7 . The forward iterator 740 may make itpossible to hide replica selection as well as the unpacking of thecompressed elements, according to some embodiments. In the exampleembodiment of FIG. 7 , SmartArrayIterator is an abstract class holding apointers 742, 744 and 746 to the referenced Smart Array, the targetreplica, and the current index of the iterator, respectively. A newiterator may be created by calling the allocate( ) static function 748.According to the example embodiment of FIG. 7 , the allocate( )function748 sets the target replica by calling the given SmartArray'sgetReplica( ) function 708 to get the replica that corresponds to thesocket of the calling thread, and finally constructs and returns one ofthe concrete sub-classes depending on the bit compression of theunderlying SmartArray 700. In C++ the iterator may be allocated slightlydifferently when compiled into LLVM bitcode for use from the runtimecross-language compiler 130. For instance, the iterator may be allocatedtransparently in the runtime cross-language compiler's heap, such as togive the compiler the chance to additionally optimize the allocationwhen compiling the user's code (e.g., code that uses our iterator API).The reset( ) function 750 may reset the current index to what is givenas the argument, the next( ) function 752 may move to the next index,while the get( ) function 754 may get the element corresponding to thecurrent index, according to the example embodiment.

The example SmartArrayIterator 740 has three concrete subclasses 760,770 and 780. Two (e.g., Uncompressed32Interator 760 andUncompressed64Interator 770) correspond to the uncompressed cases with32 and 64 bits per element, respectively, for which specialized versionsusing 32-bit and 64-bit integers directly may be used. The third (e.g.,Compressedlterator 780) corresponds to all other cases of bitcompression. The Compressedlterator 780 holds a buffer 782 for unpackingelements. When the next( ) function moves to the next chunk, it may callthe Smart Array's unpack( ) function to fetch the next 64 elements intothe buffer, while the get( ) function may return the element from thebuffer corresponding to the current index, according to the exampleembodiment.

While described above regarding arrays, when utilized other adaptivedata collections, in the case of bit compression, the iterator API mayhave to test whether a new chunk needs unpacking. This may generate alarge number of branch stalls, which may not be evaluated speculativelyand may increase CPU load. A different unified API for languages thatsupport user-defined lambdas may be used in some embodiments. Forexample, in one embodiment the unified API may provide a bounded map()interface accepting a lambda and a range to apply it over. Incomparison to the iterator API, the map interface may further improveperformance as it may not stall on the branches because it is able toremove many of them, and to speculatively execute the lambda in theremaining cases, according to various embodiments.

While described above in terms of specific class, method, variable andfunction names, an adaptive data collection may be implemented usingdiffering numbers of classes, methods, variables, functions, etc. whichmay be named differently than those described herein.

As noted previously, a thin API may be provided, such as to hide theruntime cross-language compiler's API calls to the entry points of aunified API. FIG. 4 , discussed above shows a simplified example of anexample Java wrapper class for an example adaptive data collection(e.g., SmartArray). The wrapper class stores the pointer to the nativeobject of the SmartArray. The native pointer is given to the entry pointfunctions. Entry points and wrapper classes only for the two abstractclasses (e.g., SmartArray and SmartArray-Iterator) of our unified APIare provided in the example embodiment illustrated by FIG. 4 , discussedabove.

However, entry points and wrapper functions may have an additionalversion where the user (e.g., code that accesses the adaptive datacollection) may pass the number of bits with which the Smart Array is tobe bit-compressed. Depending on the number of bits, the entry pointbranches off and redirects to the function of the correct sub-class,thus avoiding the overhead of a virtual dispatch and dispensing with theneed to provide separate entry points to the sub-classes, according tosome embodiments. Moreover, in some embodiments, the runtimecross-language compiler may avoid the branching in the entry points byprofiling the number of bits during the interpreted runs and consideringit as fixed during optimization and when applying just-in-timecompilation.

Illustrated below is an example function (Function 4) showing oneexample of what the final experience may look like from a programmer'sview using a simple example of an aggregation of an adaptive array inC++ and Java. The example below uses an iterator since the aggregationscans the adaptive array.

Function 4 aggregate( ) example in both C++ and Java  1: // C++  2: it =SmartArrayIterator::allocate(smartArray, 0);  3: for (long i=0; i <smartArray.getLength( ); i++) {  4: sum += it−>get( );  5: it−>next( ); 6: }  7: // Java  8: it = new SmartArrayIterator(smartArray, 0);  9:long bits = GraalVM.profile(smartArray.getBits( )); 10: for (long i=0; i< smartArray.getLength( ); i++) { 11: sum += it.get(bits); 12:it.next(bits); 13: }

The C++ example above uses the abstract SmartArraylterator class 740,but can immediately use a concrete sub-class depending on the number ofbits with which the Smart Array is bit-compressed in order to avoid anyvirtual dispatch overhead.

The example Java function is very similar to the example C++ function.It is executed with the runtime cross-language compiler 130. Theversions of the thin API's functions that receive the number of bits areused. Additionally, the runtime cross-language compiler's APIfunctionalities are used to “profile” the number of bits, such as toensure that the compiler considers the number of bits fixed duringcompilation, as well as to incorporate the final code of the get( ) andnext( ) functions of the concrete sub-class, thereby avoiding anyvirtual dispatch or branching overhead. For example, if the Smart Arrayis bit-compressed with 33 bits, the next( ) function may unpack every 64elements immediately with the code of theBitCompressedArray<33>::unpack( ) function, whereas if the Smart Arrayis uncompressed with 64 bits, then the get( ) and next( ) functions maybe so simple that compiled code simply increases a pointer at everyiteration of the loop without needing to allocate anything for theiterator, according to some embodiments.

The input data, the cost, benefit, and availability of the optimizationscan vary depending on the machine, the algorithm in various embodiments.Table 1 describes the trade-offs, according to one embodiment.

TABLE 1 Trade-offs of Various Example Data Functionalities. TechniqueAdvantages Disadvantages Bit Smaller memory footprint. Extra CPU loadper compression Less memory bandwidth. access. Replication Lessinterconnect traffic. More memory footprint. Spreads load evenly acrossall Time initializing memory channels. replicas. Only for read-onlydata. Interleaved Effective use of bidirectional May leave memoryinterconnect. bandwidth unused as Load on memory approximately threadsstall on equal across banks. interconnect transfers. Single Increase inspeed on the local Only advantageous if the Socket socket can outweighthe loss of memory bandwidth is performance elsewhere. much higher thanthe interconnect bandwidth.

An adaptivity mechanism 450 utilized with adaptive data collections may,in some embodiments, enable a more dynamic adaptation betweenalternative implementations at runtime, such as by considering thechanges in the system load as other workloads start and finish, or thechanges in utilization of main memory. Additionally, an adaptivitymechanism may in some embodiments re-apply its adaptivity workflow toselect a potentially new set of adaptive data functionalities and datalayouts for multiple adaptive data collections. This process mayconsider the concurrent workloads of all supported languages on eachsmart collection.

As described herein according to one example embodiment, the system mayperform configuration selection to select a placement candidate foruncompressed data placement and, if possible, a placement candidate forcompressed data placement. Then, analytics may be used to determinewhich configuration, including which placement candidates, to use. Asnoted above, a configuration may specify one or more datafunctionalities for the data collection. After determining whichconfiguration to use, data collection may be configured according to thedetermined configuration. For instance, if a selected configurationspecifies a NUMA-aware data placement scheme (e.g., OS default, singlesocket, replication, interleaved, etc.), the data collection may beconfigured according to the given data placement scheme. Similarly, ifthe selected configuration specifies a compression algorithm, the datacollection may be configured to use that compression algorithm whenstoring and retrieving data of the collection.

FIG. 8 is a flowchart illustrating one embodiment of a method forplacement configuration selection, as described herein. As illustratedin block 810, the system may be configured to collect initial workloadinformation according to one embodiment. In some embodiments, differentconfiguration's resources needs may be predicted based on a small numberof workload measurements and a particular configuration (e.g., aparticular adaptable data collection configuration) may be selected foreach scenario, while the decisions about which configurations to use maybe made manually (e.g., by the programmer) or automatically (e.g., atruntime by the system), according to various embodiments.

Thus, a configuration may be selected based on one or more predictedresource requirements for the workload to be executed. For example, inone embodiment the configuration selection may be based on variousinputs, referred to herein as initial workload information. For example,in one embodiment, the configuration selection may be based on threeinputs, including, according to one example: 1) A specification of themachine containing the size of the system memory, the maximum bandwidthbetween components and the maximum compute capability available on eachcore; 2) a specification of performance characteristics of the datacollections, such as the costs of accessing a compressed data item. Thismay be derived from performance counters and may be specific to the datacollection and/or machine, but may not be specific not a given workload;and 3) information collected from hardware performance countersdescribing the memory, bandwidth, and processor utilization of theworkload. Please note that the specific type of input used to select aconfiguration may vary from those described above and from embodiment toembodiment.

The configuration used when collecting the initial workload informationmay vary from embodiment to embodiment. For instance, in one embodimentan uncompressed interleaved placement may be used with an equal numberof threads on each core. Interleaving may provide symmetry in executionand, as the interconnect links on many processors may be independent ineach direction, the bandwidth available to perform the restructuring ofthe memory may be effectively doubled, thereby potentially reducing thetime to change data placement if restructuring on the fly isimplemented, according to one embodiment.

In some embodiments, information from hardware performance counters maybe collected from one or more profiling runs (e.g., executions) of thesame workload. In some embodiments, the profiling runs may be previousiterations of an iterative workload (e.g., PageRank iterating toconvergence). Alternatively, in another embodiment, one could collectworkload information from early batches of a loop over the datacollection, and restructure the array on the fly.

In some embodiment, the system may be configured to select anuncompressed configuration placement candidate, as in block 820. Turningnow to FIG. 9 , which is a flowchart illustrating one method forcandidate selection with uncompressed placement of an adaptive datacollection according to one example, embodiment. In some embodiments,choosing a placement for compression may require some of the tests to bemoved forward in order to determine if compression is possible beforeconsidering which data placement to use. For example, every access mayrequire a number of words to be loaded, making random accesses moreexpensive for compressed data than uncompressed data. Decisionsillustrated in FIG. 9 are split into two categories: a) softwarecharacteristics based on information provided by the programmer,compiler, etc. such as numbers of iterations and whether the accessesare read-only, and b) runtime characteristics, denoted in grey, based onmeasurements of the workload, according to one embodiment.

As illustrated by decision block 900, the system may first determinewhether the workload is not memory-bound. If the workload is not memorybound, as illustrated by the negative output of decision block 900, thesystem may then select an interleaved configuration as a candidate, asin block 990. If however, the workload is memory-bound, as indicated bythe positive output of decision block 900, the system may then determinewhether there is space sufficient for uncompressed replication, as indecision block 910. For example, replicating data collections, or singlesocket allocation, requires that enough memory be available on eachsocket. There may be different versions of this test for compressed anduncompressed data as compression can make replication possible whereuncompressed data would not fit otherwise.

If, as indicated by the negative output of decision block 910, there isnot enough space for uncompressed replication, the system may select aninterleaved configuration as a candidate, as in block 990. If, however,there is enough space for uncompressed replication, as indicated by thepositive output of decision block 910, the system may then determinewhether the data collection is read only as in decision block 920. Ifthe data collection is read only, as indicated by the positive output ofdecision block 920, the system may then determine whether the workloadincludes significant random accesses, as in decision block 930. Forinstance, if a workload contains many random accesses, then theadditional latency cost may affect the point at which replication isworthwhile. Thus, the system may be configured to analyze and comparethe number of random accesses of the data collection against athreshold, that may be predetermined and/or configurable according tovarious embodiments. The determination of whether there are significantrandom accesses, and/or the threshold used for such a determination, maybe (or may be considered) a machine-specific bound, in some embodiments.

If the workload includes significant random accesses, as indicated bythe positive output of decision block 930, it may then be determinedwhether the workload includes multiple random accesses per element, asin decision block 940. For example, there may be a time cost toinitialize replicated data and sufficient accesses may be required toamortize this cost. The bounds (e.g., the thresholds) for this may bemachine-specific and may vary depending on whether the accesses arerandom or linear, according to some embodiments. Thus, the system may beconfigured to determine whether there multiple random accesses perelement. If so, as indicated by the positive output of decision block940, the system may select a replicated data configuration as acandidate, as in block 970.

Returning to decision block 920, if the workload is not read only, asindicated by the negative output of decision block 920, the system maydetermine whether the total local speedup is greater than the totalremote slowdown, as in decision block 960. For example, for someworkloads on some architectures, it may be better to keep all data on asingle socket. In some embodiments, this strategy may work when theratio between remote and local access bandwidth is very high. In somecases, the speedup for some threads performing only local accesses mayoutweigh the slowdown of the threads performing remote accesses. Thus,in some embodiments, The system may be configured to compare the totallocal access bandwidth to the total remote bandwidth to determinewhether the total local speedup is greater than the total remoteslowdown.

To determine whether the speedup for some threads performing only localaccesses outweighs the slowdown of the threads performing remoteaccesses, as in decision block 960, the system may be configured toperform one or more of the following calculations. The examplecalculations below are for a two-socket machine, however in otherembodiments machines with differing numbers of sockets may be used withsimilar, but suitably modified, calculations.

First, the system may calculate how quickly a socket could compute ifrelieved of any memory limitations. In some embodiments, the notion ofexecution rate (exec) may be used to represent the instructions executedper time unit. Additionally, frequency scaling may make instructions percycle (IPC) an inappropriate metric in some embodiments. Thus:improvement_(exec)=exec_(max)/exec_(current)

Second, the system may be configured to use the “used” and “available”bandwidth (bw) both between sockets and to main memory in order tocalculate how fast the local socket could compute with all localaccesses assuming that the remote socket is saturating the interconnectlink, according to one embodiment. To account for bandwidth lost due tolatency, the bandwidth values taken from the machine description mayscaled to the maximum bandwidth used by the workload during measurement.For example, if a 90% utilization of the link that is a bottleneck isachieved (e.g., measured), the maximum performance of all links may bescaled to 90% to reflect the maximum possible utilization. Thus:improvement_(bw)=(bw _(max memory) −bw _(max interconnect))/bw_(current memory)

The minimum of these two improvements may be taken as the maximumspeedup of the local socket: speedupiocai. Finally, the maximum speedupof the remote socket with all remote accesses may be calculated. Thisvalue may be expected to be less than 1, indicating a slowdown:speedup_(remote) =bw _(max interconnect) /bw _(current memory)

If the average of the local and remote speedup is greater than 1, thenhaving the data on a single socket may be beneficial, according to someembodiments.

Thus, if the system determines, such as by using the above calculationsthat the local speedup is greater than the remote slowdown, as indicatedby the positive output of decision block 960, the system may select asingle socket configuration as a candidate as in block 980. If however,the local speedup is not greater than the remote slowdown, as indicatedby the negative output of decision block 960, the system may select aninterleaved configuration as a candidate, as in block 990.

Returning to decision block 930, if as indicated by the negative output,it is determined that there are no significant random accesses, thesystem may be configured to determine whether the workload includes(e.g., performs) multiple linear accesses per element, as in block 950.As with determining whether the workload includes (e.g., performs)multiple random accesses per element, discussed above, there may be atime cost to initialize replicated data and sufficient accesses may berequired to amortize this cost. The bounds (e.g., the thresholds) forthis may be machine-specific and may vary depending on whether theaccesses are random or linear, according to some embodiments. If thesystem determines that there are multiple linear access per element, asindicated by the positive output of decision block 950, the system mayselect a replicated data configuration as a candidate, as in block 970.Alternatively, if it is determined that there are not multiple accessesper element, as indicated by the negative output of decision block 950,processing may proceed to decision block 960, discussed above.

While the method illustrated by the flowchart in FIG. 9 is illustratedand described using particular steps in a particular order, those stepsare only for ease of description. However, in various embodiments, thefunctionality of the method of FIG. 9 may be performed in a differentorder, some steps may be combined, steps may be removed and/oradditional ones included.

Returning now to FIG. 8 , the system may be configured to select acompressed configuration candidate, as in block 830. As discussed above,in some embodiments, the system may be configured to select both acandidate configuration with compression and a candidate configurationwithout compression. FIG. 10 is a flowchart illustrating one method forcandidate selection with compressed placement of an adaptive datacollection according to one example, embodiment. Decisions illustratedin FIG. 10 are split into two categories: a) software characteristicsbased on information provided by the programmer, compiler, etc. such asnumbers of iterations and whether the accesses are read-only, and b)runtime characteristics, denoted in grey, based on measurements of theworkload, according to one embodiment.

As in decision block 1000, the system may determine whether the workloadis memory bound. If it is determined that the workload is memory bound,as indicated by the positive output of decision block 1000, the systemmay then determine whether the workload includes mostly reads, as indecision block 1005. When determining whether the workload includesmostly reads, the system may compare the percentage of accesses that arereads to a threshold (whether predetermined or configurable). If asindicated by the positive output of decision block 1005, the workload isdetermined to include mostly reads, the system may then determinewhether there are a significant number of random accesses, as indecision block 1010.

For instance, if a workload includes many random accesses, then theadditional latency cost may affect the point at which replication isworthwhile. Thus, the system may be configured to analyze and comparethe number of random accesses of the data collection against athreshold, that may be predetermined and/or configurable according tovarious embodiments. The determination of whether there are significantrandom accesses, and/or the threshold used for such a determination, maybe (or may be considered) a machine-specific bound, in some embodiments.If the workload includes significant random accesses, as indicated bythe positive output of decision block 1010, the system may thendetermine not to use compression, as in block 1040. Similarly, if thesystem determines that the workload is not memory bound, as indicated bythe negative output of decision block 1000, or if the system determinesthat the workload does not include mostly reads, as indicated by thenegative output of decision block 1005, the system may determine not touse compression, as in block 1040.

If it is determined that the workload does not include significantrandom accesses, as indicated by the negative output of decision block1010, the system may then determine whether there is space sufficientfor compressed replication, as in decision block 1015. For example,replicating data collections, or single socket allocation, requires thatenough memory be available on each socket. There may be differentversions of this test for compressed and uncompressed data ascompression can make replication possible where uncompressed data wouldnot fit otherwise. If it is determined that there is space forcompressed replication, as indicated by the positive output of decisionblock 1015, the system may then determine whether the data collection isread only as in decision block 1020. If the data collection is readonly, as indicated by the positive output of decision block 1020, thesystem may then determine whether the workload includes multiple linearaccesses per element, as in decision block 1025. As described aboveregarding block 950 of FIG. 9 , there may be a time cost to initializereplicated data and sufficient accesses may be required to amortize thiscost. The bounds and/or thresholds used to determine whether enoughlinear accesses per element exist, may in some embodiments bemachine-specific.

If the workload includes multiple linear accesses per element, asindicated by the positive output of decision block 1025, select areplicated data configuration with compression as a candidate, as inblock 1050. Alternatively, if it is determined that there are notsignificant random accesses, as indicated by the negative output ofdecision block 1025, the system may be configured to determine whetherthe total local speedup is greater than the total remote slowdown, as indecision block 1030. The system may determine whether the total localspeedup is greater than the total remote slowdown in the same manner asthat described above regarding block 960 of FIG. 9 , except usingcomputed values for execution rate (i.e., execcompressed) and bandwidth(i.e., bwcompressed), discussed below. If it is determined that thelocal speedup is greater than the remote slowdown, as indicated by thepositive output of decision block 1030, the system may then determine toselect a single socket configuration with compression as a candidate, asin block 1060. If however, it is determined that the local speed up isnot greater than the remote slowdown, as indicated by the negativeoutput of decision block 1030, the system may determine to select aninterleaved configuration with compression as a candidate, as in block1070.

While the method illustrated by the flowchart in FIG. 10 is illustratedand described using particular steps in a particular order, those stepsare only for ease of description. However, in various embodiments, thefunctionality of the method of FIG. 10 may be performed in a differentorder, some steps may be combined, steps may be removed and/oradditional ones may be included.

Returning to FIG. 8 , the system may be configured to select between theuncompressed configuration candidate and the compressed configurationcandidate, as in block 840. For example, after selecting placementcandidates as described above, the system may be configured to determinewhether to use the selected candidate with or without compression. Insome embodiments, a first step in determining whether to use compressionmay be to add to the profile of the compression candidate the additionalcomputation effort (exec_(compressed)) required to perform thecompression. To determine this, the system may also need to know, inaddition to the current compute rate (exec_(current)), the number ofaccesses per second (#accesses) as well as the cost per access resultingfrom the extra CPU load that needs to be executed (as cost). In someembodiments, the cost of decompression may vary with the compressionratio, since the number of values that can be extracted per instructionchanges. Thus, in one embodiment, the computation effort required toperform the compression may be determined according to the followingformula:exec_(compressed)=exec_(current)+#accesses·cost

The reduction in bandwidth may also be calculated in a similar fashion,using a compression ratio (r) [0 . . . 1] of the compressed and theuncompressed size of the elements (elemsize), as below:bw _(compressed) =bw _(current memory)#accesses·(1−r)·elemhd size

Using computed values, as discussed above, for the compressed case andthe measured values for the uncompressed case, the system may estimateeach placement's speedup. For instance, the system may be configured tocompute, for each placement, the ratio of the maximum compute raterelative to the current rate. Thus, the system may obtain eachcandidate's speedup if the workload is not memory-bound. Next, for eachsocket the system may compute the ratio of the maximum memory bandwidthfor each candidate placement relative to the current bandwidth. Thisgives the socket speedup assuming the workload is not compute-bound.Finally, for each socket, the system may take the minimum of their tworatios as the socket's estimated speedup and average these for theconfigurations' estimated speedup. The system may then choose theconfiguration predicted to be the fastest, according to someembodiments.

As noted above, adaptive data collections and/or corresponding adaptivedata functionalities may be implemented within a runtime system thatsupports parallel loops with dynamic distribution of loop iterationsbetween worker threads. In some embodiments, adaptive data collectionsand/or corresponding adaptive data functionalities may be developed in agiven programming language, such as C++, regardless of what language(s)may be used to access the data collections. This approach may beconsidered to provide a number of potential advantages, such as: (i) inC++ the memory layout of the adaptive data collections may be controlledby interfacing with the operating system (OS), such as by making systemcalls for NUMA-aware data placement, (ii) by careful design of thecross-language (e.g., Java to C++) interface, the runtime cross-languagecompiler may be used to inline the implementation into other languagesand thereby to potentially optimize it alongside user code, and (iii) byhaving a single implementation, re-implementing functionality formultiple languages may be avoided while still enabling multi-languageworkloads, according to various embodiments. However, the particularexample advantages mentioned above may or may not be achieved in anygiven implementation of adaptive data collections.

In some embodiments, a thin API may be provided, such as to hide theruntime cross-language compiler's API calls to the entry points of aunified API. FIG. 11 is a logical block diagram illustrating one exampleof an adaptive data collection accessible via the Java programminglanguage, according to one embodiment. FIG. 11 shows conceptually how,according to one example embodiment, the C++ implementation may beexposed to Java.

In addition, FIG. 11 depicts three different interoperability paths1150, 1152 and 1154 via which the native world of the runtime system1120 may interact with the managed world of Java. The firstinteroperability path 1150 may, in some embodiments, be consideredcentral to the efficient interoperability between C++ implementedadaptive data collections and Java. This may be the fastestinteroperability path, and may be made available by the runtimecross-language compiler 130 to enable access to adaptive datacollections for any supported guest language, including Java. Throughthis path, the ability of the runtime cross-language compiler tooptimize and/or compile the adaptive data functionalities (e.g.,implemented in C++). For example, the LLVM bitcode 1135 ofEntrypoints.cpp may be optimized and/or compiled with the code 1130.

Additionally, one or more entry point functions may be exposed via aunified API of the adaptive data collections. The entry points may becompiled with into bitcode (e.g., LLVM bitcode) which runtimecross-language compiler 130 may execute. Additionally, in someembodiments, these entry points may be seamlessly used by guestlanguages running on top of runtime cross-language compiler 130.

In some embodiments, a per-language thin API layer 1130 that mirrors theunified API may be provided. For instance, one example is shown in FIG.11 for the case of Java according to one example implementation. Onepurpose for using per-language thin API layer 1130 may be to hide theAPI of the runtime cross-language compiler 130 and make accessing theentry points more convenient. Note that no adaptive functionality may bere-implemented in Java. Instead, the SmartArray::get( ) functionillustrated in FIG. 11 may incorporate the C++ logic for potentialreplicas and/or bit decompression. The function may be exposed as anentry point that is compiled into bitcode (e.g., LLVM bitcode) andexecuted by the runtime cross-language compiler 130. The runtimecross-language compiler 130 may then execute user Java code, the Javathin API (including the adaptive data collection functionality) anddynamically optimize and compile the multi-language application (e.g.,the application accessing the adaptive data collection).

Two additional interoperability paths that may be used for accessingcomponents used by adaptive data collections. For instance,interoperability path 1152 may be via JNI and unsafe (e.g., code that isgenerally unsafe, but sometimes required, esp. within low level code)methods 1160. This path may exist for any Java application, however, JNImay be slow for array accesses and unsafe may not be interoperable,according to some embodiments. Thus, interoperability path 1151 may beused to access the runtime system's native functionality for parallelloop scheduling, in some embodiments. The third interoperability path1154 may be the runtime cross-language compiler's native functioninterface (NFI) capability for the runtime compiled code to call intoprecompiled native libraries, such as the native library 1165 of theruntime system. In some embodiments, this may be the slowest path sinceNFI, similar to JNI, may need both pre- and post-processing.

The systems, techniques, methods and/or mechanisms described herein forimplementing adaptable data collections may be applicable to anyapplication/system that uses data collections, and specifically arrays,for storing and processing data (e.g., database management systems suchas SAP HANA, MS SQL Server, etc., as well as graph processing systemsuch as Oracle PGX, Oracle Database, and Neo4j, among others). Thesesystems may be configured to implement and employ adaptable datacollections, such as to exploit the language-independent adaptiveoptimizations described herein for NUMA-awareness and bit compression.

Example Computing System

The techniques and methods described herein for Detection, Modeling andApplication of Memory Bandwidth Patterns may be implemented on or by anyof a variety of computing systems, in different embodiments. Forexample, FIG. 12 is a block diagram illustrating one embodiment of acomputing system that is configured to implement such techniques andmethods, as described herein, according to various embodiments. Thecomputer system 1200 may be any of various types of devices, including,but not limited to, a personal computer system, desktop computer, laptopor notebook computer, mainframe computer system, handheld computer,workstation, network computer, a consumer device, application server,storage device, a peripheral device such as a switch, modem, router,etc., or in general any type of computing device.

Some of the mechanisms for Detection, Modelling and Prediction of MemoryAccess Patterns, as described herein, may be provided as a computerprogram product, or software, that may include a non-transitory,computer-readable storage medium having stored thereon instructions,which may be used to program a computer system 1200 (or other electronicdevices) to perform a process according to various embodiments. Acomputer-readable storage medium may include any mechanism for storinginformation in a form (e.g., software, processing application) readableby a machine (e.g., a computer). The machine-readable storage medium mayinclude, but is not limited to, magnetic storage medium (e.g., floppydiskette); optical storage medium (e.g., CD-ROM); magneto-opticalstorage medium; read only memory (ROM); random access memory (RAM);erasable programmable memory (e.g., EPROM and EEPROM); flash memory;electrical, or other types of medium suitable for storing programinstructions. In addition, program instructions may be communicatedusing optical, acoustical or other form of propagated signal (e.g.,carrier waves, infrared signals, digital signals, etc.)

In various embodiments, computer system 1200 may include one or moreprocessors 1270; each may include multiple cores, any of which may besingle- or multi-threaded. For example, multiple processor cores may beincluded in a single processor chip (e.g., a single processor 1270), andmultiple processor chips may be included in computer system 1200. Eachof the processors 1270 may include a cache or a hierarchy of caches1275, in various embodiments. For example, each processor chip 1270 mayinclude multiple L1 caches (e.g., one per processor core) and one ormore other caches (which may be shared by the processor cores on asingle processor). The computer system 1200 may also include one or morestorage devices 1250 (e.g. optical storage, magnetic storage, harddrive, tape drive, solid state memory, etc.) and one or more systemmemories 1210 (e.g., one or more of cache, SRAM, DRAM, RDRAM, EDO RAM,DDR 10 RAM, SDRAM, Rambus RAM, EEPROM, etc.). In some embodiments, oneor more of the storage device(s) 2450 may be implemented as a module ona memory bus (e.g., on interconnect 1240) that is similar in form and/orfunction to a single in-line memory module (SIMM) or to a dual in-linememory module (DIMM). Various embodiments may include fewer oradditional components not illustrated in FIG. 12 (e.g., video cards,audio cards, additional network interfaces, peripheral devices, anetwork interface such as an ATM interface, an Ethernet interface, aFrame Relay interface, etc.)

The one or more processors 1270, the storage device(s) 1220, and thesystem memory 1210 may be coupled to the system interconnect 1240. Oneor more of the system memories 1210 may contain program instructions1220. Program instructions 1220 may be executable to implement runtimecross-language compiler 130, adaptive data collection(s) 110, languagespecific interface(s) 120, and/or application(s) 140 as well as otherprograms/components configured to one or more of the systems, methodsand/or techniques described herein.

Program instructions 1220 may be encoded in platform native binary, anyinterpreted language such as Java™ byte-code, or in any other languagesuch as C/C++, the Java™ programming language, etc., or in anycombination thereof. In various embodiments, implement runtimecross-language compiler 130, adaptive data collection(s) 110, languagespecific interface(s) 120, and/or application(s) 140 may each beimplemented in any of various programming languages or methods. Forexample, in one embodiment, implement runtime cross-language compiler130, adaptive data collection(s) 110, language specific interface(s)120, and/or application(s) 140 may be based on the Java programminglanguage, while in other embodiments they may be written using the C orC++ programming languages. Moreover, in some embodiments, implementruntime cross-language compiler 130, adaptive data collection(s) 110,language specific interface(s) 120, and/or application(s) 140 may not beimplemented using the same programming language.

Although the embodiments above have been described in considerabledetail, numerous variations and modifications will become apparent tothose skilled in the art once the above disclosure is fully appreciated.For example, although many of the embodiments are described in terms ofparticular types of operations that support synchronization withinmulti-threaded applications that access particular shared resources, itshould be noted that the techniques and mechanisms disclosed herein foraccessing and/or operating on shared resources may be applicable inother contexts in which applications access and/or operate on differenttypes of shared resources than those described in the examples hereinand in which different embodiments of the underlying hardware thatsupports persistent memory transactions described herein are supportedor implemented. It is intended that the following claims be interpretedto embrace all such variations and modifications.

What is claimed is:
 1. A method, comprising: performing by a computerincluding multiple sockets each including multiple processor cores:executing a workload within a platform independent virtual environmentof the computer by an application developed using a first programminglanguage, wherein said executing comprises: optimizing, at runtime, by aruntime cross-language compiler, at least a portion of the applicationand one or more dynamically selected data functionalities developedusing a second programming language and configured to provide a languagespecific interface to access to a data collection storing data in amemory of the computer; and accessing, by the application via the one ormore dynamically selected data functionalities, the data collection,wherein said accessing the data collection comprises the applicationexecuting one or more methods of a data access application programminginterface (API) configured to access the data collection via thelanguage specific interface.
 2. The method of claim 1, wherein saidoptimizing comprises: compiling, by the runtime_cross-language compiler,at least a portion of the application and the one or more dynamicallyselected data functionalities.
 3. The method of claim 1, furthercomprising: calling, by at least one of the one or more dynamicallyselected data functionalities using the second programming language, oneor more system calls of the platform independent virtual environment,wherein the one or more system calls provide NUMA-aware data placementof data for the data collection.
 4. The method of claim 1, wherein saidoptimizing comprises: inlining, by the runtime cross-language compiler,one or more portions of the one or more dynamically selected datafunctionalities using the first programming language.
 5. The method ofclaim 1, further comprising: providing a thin API developed using thefirst programming language to the application, wherein the thin APImirrors, and provides an entry point to, at least one function of thedata access API; and wherein said optimizing comprises optimizing thethin API with the at least a portion of the application.
 6. The methodof claim 1, further comprising: selecting at runtime, based at least inpart on one or more predicted resource requirements of the workload, afirst configuration specifying the one or more dynamically selected datafunctionalities; and configuring the data collection according to thefirst configuration, wherein after said configuring, the data of thedata collection is accessible according to the dynamically selected datafunctionalities as specified by the first configuration.
 7. The methodof claim 1, further comprising: executing a second workload within theplatform independent virtual environment of the computer by a secondapplication developed using a third programming language, wherein saidexecuting the second workload comprises: optimizing, at runtime, by theruntime cross-language compiler, at least a portion of the secondapplication and one or more of the dynamically selected datafunctionalities; and accessing, by the second application via the one ormore dynamically selected data functionalities, the data collection. 8.The method of claim 1, wherein the one or more dynamically selected datafunctionalities comprise one or more of: an operating system defaultNUMA-aware data placement for data of the data collection; a singlesocket NUMA-aware data placement for data of the data collection; aninterleaved NUMA-aware data placement for data of the data collection; areplicated NUMA-aware data placement for data of the data collection; acompression scheme for data of the data collection; an indexing schemefor data elements of the data collection; and a data synchronizationscheme for the data collection.
 9. The method of claim 1, wherein thedata collection is configured to organize the data of the datacollection as one of a bag, a set, an array, or a map.
 10. A system,comprising: a computing device comprising multiple sockets, eachcomprising multiple processor cores; and a memory coupled to thecomputing device comprising program instructions executable by thecomputing device to implement a platform independent virtual environmentconfigured to: execute a workload within the platform independentvirtual environment of the computer by an application developed using afirst programming language, wherein said executing comprises: optimize,at runtime, by a runtime cross-language compiler, at least a portion ofthe application and one or more dynamically selected datafunctionalities developed using a second programming language andconfigured to provide a language specific interface to access to a datacollection storing data in the memory of the computing device; andaccess, by the application via the one or more dynamically selected datafunctionalities, the data collection, wherein said accessing the datacollection comprises the application executing one or more methods of adata access application programming interface (API) configured to accessthe data collection via the language specific interface.
 11. The systemof claim 10, wherein the platform independent virtual environment isfurther configured to: compile, by the runtime_cross-language compiler,at least a portion of the application and the one or more dynamicallyselected data functionalities.
 12. The system of claim 10, wherein theplatform independent virtual environment is further configured to: call,by at least one of the one or more dynamically selected datafunctionalities using the second programming language, one or moresystem calls of the platform independent virtual environment; whereinthe one or more system calls provide NUMA-aware data placement of datafor the data collection.
 13. The system of claim 10, wherein theplatform independent virtual environment is further configured to:inline, by the runtime cross-language compiler, one or more portions ofthe one or more dynamically selected data functionalities using thefirst programming language.
 14. The system of claim 10, wherein theplatform independent virtual environment is further configured to:provide a thin API developed using the first programming language to theapplication, wherein the thin API mirrors, and provides an entry pointto, at least one function of the data access API; wherein to optimizethe at least a portion of the application, the program instructions arefurther executable to optimize the thin API with the at least a portionof the application.
 15. The system of claim 10, wherein the platformindependent virtual environment is further configured to: select atruntime, based at least in part on one or more predicted resourcerequirements of the workload, a first configuration specifying the oneor more dynamically selected data functionalities; and configure thedata collection according to the first configuration, wherein after saidconfiguring, the data of the data collection is accessible according tothe one or more dynamically selected data functionalities as specifiedby the first configuration.
 16. The system of claim 10, wherein the oneor more dynamically selected data functionalities comprise one or moreof: an operating system default NUMA-aware data placement for data ofthe data collection; a single socket NUMA-aware data placement for dataof the data collection; an interleaved NUMA-aware data placement fordata of the data collection; a replicated NUMA-aware data placement fordata of the data collection; a compression scheme for data of the datacollection; an indexing scheme for data elements of the data collection;and a data synchronization scheme for the data collection.
 17. One ormore non-transitory computer readable storage media storing programinstructions executable on or across one or more processors cause theone or more processors to implement: executing a workload within aplatform independent virtual environment of the computer by anapplication developed using a first programming language, wherein saidexecuting comprises: optimizing, at runtime, by a runtime cross-languagecompiler, at least a portion of the application and one or moredynamically selected data functionalities developed using a secondprogramming language and configured to provide a language specificinterface to access to a data collection; and accessing, by theapplication via the one or more dynamically selected datafunctionalities, the data collection, wherein said accessing the datacollection comprises the application executing one or more methods of adata access application programming interface (API) configured to accessthe data collection via the language specific interface.
 18. The one ormore non-transitory computer readable storage media as recited in claim17, storing further program instructions executable on or across the oneor more processors cause the one or more processors to implement:compiling, by the runtime_cross-language compiler, at least a portion ofthe application and the one or more dynamically selected datafunctionalities.
 19. The one or more non-transitory computer readablestorage media as recited in claim 17, storing further programinstructions executable on or across the one or more processors causethe one or more processors to implement: selecting at runtime, based atleast in part on one or more predicted resource requirements of theworkload, a first configuration specifying the one or more dynamicallyselected data functionalities; and configuring the data collectionaccording to the first configuration, wherein after said configuring,the data of the data collection is accessible according to thedynamically selected data functionalities as specified by the firstconfiguration.
 20. The one or more non-transitory computer readablestorage media as recited in claim 17, storing further programinstructions executable on or across the one or more processors causethe one or more processors to implement: calling, by at least one of theone or more dynamically selected data functionalities using the secondprogramming language, one or more system calls of the platformindependent virtual environment; wherein the one or more system callsprovide NUMA-aware data placement of data for the data collection.